DataCamp offer interactive courses related to R Programming. While some is review, it is helpful to see other perspectives on material. As well, DataCamp has some interesting materials on packages that I want to learn better (ggplot2, dplyr, ggvis, etc.). This document summarizes a few key insights from:
This document is currently split between _v003 and _v003_a and _v003_b due to the need to keep the number of DLL that it opens below the hard-coded maximum. This introductory section needs to be re-written, and the contents consolidated, at a future date.
The original DataCamp_Insights_v001 and DataCamp_Insights_v002 documents have been split for this document:
Chapter 1 - Introduction
Problems in spatial statistics:
Simulation and testing with spatstat:
Further testing:
Example code includes:
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
# The number of points to create
n <- 200
# Set the range
xmin <- 0
xmax <- 1
ymin <- 0
ymax <- 2
# Sample from a Uniform distribution
x <- runif(n, xmin, xmax)
y <- runif(n, ymin, ymax)
# The ratio of the Y axis scale to the X axis scale is called the aspect ratio of the plot. Spatial data should always be presented with an aspect ratio of 1:1.
# See pre-defined variables
# ls.str()
# Plot points and a rectangle
mapxy <- function(a = NA){
plot(x, y, asp = a)
rect(xmin, ymin, xmax, ymax)
}
mapxy(1)
# How do we create a uniform density point pattern in a circle?
# We might first try selecting radius and angle uniformly. But that produces a "cluster" at small distances
# Instead we sample the radius from a non-uniform distribution that scales linearly with distance, so we have fewer points at small radii and more at large radii
# This exercise uses spatstat's disc() function, that creates a circular window.
# Load the spatstat package
library(spatstat)
## Loading required package: spatstat.data
## Loading required package: nlme
##
## Attaching package: 'nlme'
## The following object is masked from 'package:dplyr':
##
## collapse
## Loading required package: rpart
##
## spatstat 1.55-0 (nickname: 'Stunned Mullet')
## For an introduction to spatstat, type 'beginner'
##
## Note: R version 3.3.3 (2017-03-06) is more than 9 months old; we strongly recommend upgrading to the latest version
# Create this many points, in a circle of this radius
n_points <- 300
radius <- 10
# Generate uniform random numbers up to radius-squared
r_squared <- runif(n_points, 0, radius**2)
angle <- runif(n_points, 0, 2*pi)
# Take the square root of the values to get a uniform spatial distribution
x <- sqrt(r_squared) * cos(angle)
y <- sqrt(r_squared) * sin(angle)
plot(spatstat::disc(radius))
points(x, y)
# Some variables have been pre-defined
# ls.str()
# Set coordinates and window
ppxy <- ppp(x = x, y = y, window = disc(radius))
# Test the point pattern
qt <- quadrat.test(ppxy)
## Warning: Some expected counts are small; chi^2 approximation may be
## inaccurate
# Inspect the results
plot(qt)
print(qt)
##
## Chi-squared test of CSR using quadrat counts
## Pearson X2 statistic
##
## data: ppxy
## X2 = 33.064, df = 24, p-value = 0.2055
## alternative hypothesis: two.sided
##
## Quadrats: 25 tiles (irregular windows)
# In the previous exercise you used a set of 300 events scattered uniformly within a circle
# If you repeated the generation of the events again you will still have 300 of them, but in different locations
# The dataset of exactly 300 points is from a Poisson point process conditioned on the total being 300
# The spatstat package can generate Poisson spatial processes with the rpoispp() function given an intensity and a window, that are not conditioned on the total
# Just as the random number generator functions in R start with an "r", most of the random point-pattern functions in spatstat start with an "r".
# The area() function of spatstat will compute the area of a window such as a disc
# Create a disc of radius 10
disc10 <- disc(10)
# Compute the rate as count divided by area
lambda <- 500 / area(disc10)
# Create a point pattern object
ppois <- rpoispp(lambda = lambda, win = disc10)
# Plot the Poisson point pattern
plot(ppois)
# The spatstat package also has functions for generating point patterns from other process modelsparameters.
# These generally fall into one of two classes: clustered processes, where points occur together more than under a uniform Poisson process,
# and regular (aka inhibitory) processes where points are more spaced apart than under a uniform intensity Poisson process
# Some process models can generate patterns on a continuum from clustered through uniform to regular depending on their parameters
# The quadrat.test() function can test against clustered or regular alternative hypotheses
# By default it tests against either of those, but this can be changed with the alternative parameter to create a one-sided test.
# A Thomas process is a clustered pattern where a number of "parent" points, uniformly distributed, create a number of "child" points in their neighborhood
# The child points themselves form the pattern. This is an attractive point pattern, and makes sense for modelling things like trees, since new trees will grow near the original tree
# Random Thomas point patterns can be generated using rThomas()
# This takes three numbers that determine the intensity and clustering of the points, and a window object.
# Conversely the points of a Strauss process cause a lowering in the probability of finding another point nearby
# The parameters of a Strauss process can be such that it is a "hard-core" process, where no two points can be closer than a set threshold
# Creating points from this process involves some clever simulation algorithms
# This is a repulsive point pattern, and makes sense for modelling things like territorial animals, since the other animals of that species will avoid the territory of a given animal
# Random Strauss point patterns can be generated using rStrauss()
# This takes three numbers that determine the intensity and "territory" of the points, and a window object
# Points generated by a Strauss process are sometimes called regularly spaced.
# Create a disc of radius 10
disc10 <- disc(10)
# Generate clustered points from a Thomas process
set.seed(123)
p_cluster <- rThomas(kappa = 0.35, scale = 1, mu = 3, win = disc10)
plot(p_cluster)
# Run a quadrat test
quadrat.test(p_cluster, alternative = "clustered")
## Warning: Some expected counts are small; chi^2 approximation may be
## inaccurate
##
## Chi-squared test of CSR using quadrat counts
## Pearson X2 statistic
##
## data: p_cluster
## X2 = 53.387, df = 24, p-value = 0.0005142
## alternative hypothesis: clustered
##
## Quadrats: 25 tiles (irregular windows)
# Regular points from a Strauss process
set.seed(123)
p_regular <- rStrauss(beta = 2.9, gamma = 0.025, R = .5, W = disc10)
## Warning: Simulation will be performed in the containing rectangle and
## clipped to the original window.
plot(p_regular)
# Run a quadrat test
quadrat.test(p_regular, alternative = "regular")
## Warning: Some expected counts are small; chi^2 approximation may be
## inaccurate
##
## Chi-squared test of CSR using quadrat counts
## Pearson X2 statistic
##
## data: p_regular
## X2 = 8.57, df = 24, p-value = 0.001619
## alternative hypothesis: regular
##
## Quadrats: 25 tiles (irregular windows)
# Another way of assessing clustering and regularity is to consider each point, and how it relates to the other points
# One simple measure is the distribution of the distances from each point to its nearest neighbor
# The nndist() function in spatstat takes a point pattern and for each point returns the distance to its nearest neighbor
# Instead of working with the nearest-neighbor density, as seen in the histogram, it can be easier to work with the cumulative distribution function, G(r)
# This is the probability of a point having a nearest neighbour within a distance r
# For a uniform Poisson process, G can be computed theoretically, and is G(r) = 1 - exp( - lambda * pi * r ^ 2)
# You can compute G empirically from your data using Gest() and so compare with the theoretical value.
# Events near the edge of the window might have had a nearest neighbor outside the window, and so unobserved
# This will make the distance to its observed nearest neighbor larger than expected, biasing the estimate of G
# There are several methods for correcting this bias
# Plotting the output from Gest shows the theoretical cumulative distribution and several estimates of the cumulative distribution using different edge corrections
# Often these edge corrections are almost indistinguishable, and the lines overlap
# The plot can be used as a quick exploratory test of complete spatial randomness
# Two ppp objects, p_poisson, and p_regular are defined for you
# Point patterns are pre-defined
p_poisson <- ppois
p_poisson
## Planar point pattern: 510 points
## window: polygonal boundary
## enclosing rectangle: [-10, 10] x [-10, 10] units
p_regular
## Planar point pattern: 325 points
## window: polygonal boundary
## enclosing rectangle: [-10, 10] x [-10, 10] units
# Calc nearest-neighbor distances for Poisson point data
nnd_poisson <- nndist(p_poisson)
# Draw a histogram of nearest-neighbor distances
hist(nnd_poisson)
# Estimate G(r)
G_poisson <- Gest(p_poisson)
# Plot G(r) vs. r
plot(G_poisson)
# Repeat for regular point data
nnd_regular <- nndist(p_regular)
hist(nnd_regular)
G_regular <- Gest(p_regular)
plot(G_regular)
# A number of other functions of point patterns have been developed
# They are conventionally denoted by various capital letters, including F, H, J, K and L
# The K-function is defined as the expected number of points within a distance of a point of the process, scaled by the intensity
# Like G, this can be computed theoretically for a uniform Poisson process and is K(r) = pi * r ^ 2 - the area of a circle of that radius
# Deviation from pi * r ^ 2 can indicate clustering or point inhibition
# Computational estimates of K(r) are done using the Kest() function.
# As with G calculations, K-function calculations also need edge corrections
# The default edge correction in spatstat is generally the best, but can be slow, so we'll use the "border" correction for speed here
# Uncertainties on K-function estimates can be assessed by randomly sampling points from a uniform Poisson process in the area and computing the K-function of the simulated data
# Repeat this process 99 times, and take the minimum and maximum value of K over each of the distance values
# This gives an envelope - if the K-function from the data goes above the top of the envelope then we have evidence for clustering
# If the K-function goes below the envelope then there is evidence for an inhibitory process causing points to be spaced out
# Envelopes can be computed using the envelope() function
# The plot method for estimates of K uses a formula system where a dot on the left of a formula refers to K®
# So the default plot uses . ~ r
# You can compare the estimate of K to a Poisson process by plotting . - pi * r ^ 2 ~ r
# If the data was generated by a Poisson process, then the line should be close to zero for all values of r
# Point patterns are pre-defined
p_poisson
## Planar point pattern: 510 points
## window: polygonal boundary
## enclosing rectangle: [-10, 10] x [-10, 10] units
p_cluster
## Planar point pattern: 332 points
## window: polygonal boundary
## enclosing rectangle: [-10, 10] x [-10, 10] units
p_regular
## Planar point pattern: 325 points
## window: polygonal boundary
## enclosing rectangle: [-10, 10] x [-10, 10] units
# Estimate the K-function for the Poisson points
K_poisson <- Kest(p_poisson, correction = "border")
# The default plot shows quadratic growth
plot(K_poisson, . ~ r)
# Subtract pi * r ^ 2 from the Y-axis and plot
plot(K_poisson, . - pi * r**2 ~ r)
# Compute envelopes of K under random locations
K_cluster_env <- envelope(p_cluster, Kest, correction = "border")
## Generating 99 simulations of CSR ...
## 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38,
## 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76,
## 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99.
##
## Done.
# Insert the full formula to plot K minus pi * r^2
plot(K_cluster_env, . - pi * r^2 ~ r)
# Repeat for regular data
K_regular_env <- envelope(p_regular, Kest, correction = "border")
## Generating 99 simulations of CSR ...
## 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38,
## 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76,
## 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99.
##
## Done.
plot(K_regular_env, . - pi * r^2 ~ r)
Chapter 2 - Point Pattern Analysis
Bivariate point problems:
Spatial segregation:
Space-time data:
Space-time clustering:
Example code includes:
# The dataset we shall use for this example consists of crimes in a 4km radius of the center of Preston, a town in north-west England
# We want to look for hotspots of violent crime in the area
# A ppp object called preston_crime has been constructed
# This is a marked point process, where each point is marked as either a "Violent Crime" or a "Non-violent crime"
# The marks for each point can be retrieved using the marks() function
# The window is a 4km circle centered on the town center
# A map image of the town from OpenStreetMap has also been loaded, called preston_osm
preston_crime <- readRDS("./RInputFiles/pcrime-spatstat.RDS")
preston_osm <- readRDS("./RInputFiles/osm_preston_gray.RDS")
# Get some summary information on the dataset
summary(preston_crime)
## Marked planar point pattern: 2036 points
## Average intensity 4.053214e-05 points per square unit
##
## Coordinates are given to 2 decimal places
## i.e. rounded to the nearest multiple of 0.01 units
##
## Multitype:
## frequency proportion intensity
## Non-violent crime 1812 0.8899804 3.607281e-05
## Violent crime 224 0.1100196 4.459332e-06
##
## Window: polygonal boundary
## single connected closed polygon with 99 vertices
## enclosing rectangle: [349773, 357771] x [425706.5, 433705.5] units
## Window area = 50231700 square units
## Fraction of frame area: 0.785
# Get a table of marks
table(marks(preston_crime))
##
## Non-violent crime Violent crime
## 1812 224
# Define a function to create a map
preston_map <- function(cols = c("green","red"), cex = c(1, 1), pch = c(1, 1)) {
raster::plotRGB(preston_osm) # from the raster package
plot(preston_crime, cols = cols, pch = pch, cex = cex, add = TRUE, show.window = TRUE)
}
# Draw the map with colors, sizes and plot character
preston_map(
cols = c("black", "red"),
cex = c(0.5, 1),
pch = 19
)
# One method of computing a smooth intensity surface from a set of points is to use kernel smoothing
# Imagine replacing each point with a dot of ink on absorbent paper
# Each individual ink drop spreads out into a patch with a dark center, and multiple drops add together and make the paper even darker
# With the right amount of ink in each drop, and with paper of the right absorbency, you can create a fair impression of the density of the original points
# In kernel smoothing jargon, this means computing a bandwidth and using a particular kernel function
# To get a smooth map of violent crimes proportion, we can estimate the intensity surface for violent and non-violent crimes, and take the ratio
# To do this with the density() function in spatstat, we have to split the points according to the two values of the marks and then compute the ratio of the violent crime surface to the total
# The function has sensible defaults for the kernel function and bandwidth to guarantee something that looks at least plausible
# preston_crime has been pre-defined
preston_crime
## Marked planar point pattern: 2036 points
## Multitype, with levels = Non-violent crime, Violent crime
## window: polygonal boundary
## enclosing rectangle: [349773, 357771] x [425706.5, 433705.5] units
# Use the split function to show the two point patterns
crime_splits <- split(preston_crime)
# Plot the split crime
plot(crime_splits)
# Compute the densities of both sets of points
crime_densities <- density(crime_splits)
# Calc the violent density divided by the sum of both
frac_violent_crime_density <- crime_densities[[2]] /
(crime_densities[[1]] + crime_densities[[2]])
# Plot the density of the fraction of violent crime
plot(frac_violent_crime_density)
# We can get a more principled measure of the violent crime ratio using a spatial segregation model
# The spatialkernel package implements the theory of spatial segregation
# The first step is to compute the optimal bandwidth for kernel smoothing under the segregation model
# A small bandwidth would result in a density that is mostly zero, with spikes at the event locations
# A large bandwidth would flatten out any structure in the events, resulting in a large "blob" across the whole window
# Somewhere between these extremes is a bandwidth that best represents an underlying density for the process
# spseg() will scan over a range of bandwidths and compute a test statistic using a cross-validation method
# The bandwidth that maximizes this test statistic is the one to use
# The returned value from spseg() in this case is a list, with h and cv elements giving the values of the statistic over the input h values
# The spatialkernel package supplies a plotcv function to show how the test value varies
# The hcv element has the value of the best bandwidth
# spatstat is loaded and the preston_crime object is read in
# Scan from 500m to 1000m in steps of 50m
bw_choice <- spatialkernel::spseg(
preston_crime,
h = seq(500, 1000, by = 50),
opt = 1)
##
## Calculating cross-validated likelihood function
# Plot the results and highlight the best bandwidth
spatialkernel::plotcv(bw_choice)
abline(v = bw_choice$hcv, lty = 2, col = "red")
# Print the best bandwidth
print(bw_choice$hcv)
## [1] 800
# The second step is to compute the probabilities for violent and non-violent crimes as a smooth surface, as well as the p-values for a point-wise test of segregation
# This is done by calling spseg() with opt = 3 and a fixed bandwidth parameter h
# Normally you would run this process for at least 100 simulations, but that will take too long to run here
# Instead, run for only 10 simulations
# Then you can use a pre-loaded object seg which is the output from a 1000 simulation run that took about 20 minutes to complete
# Set the correct bandwidth and run for 10 simulations only
seg10 <- spatialkernel::spseg(
pts = preston_crime,
h = bw_choice$hcv,
opt = 3,
ntest = 10,
proc = FALSE)
# Plot the segregation map for violent crime
spatialkernel::plotmc(seg10, "Violent crime")
# Plot seg, the result of running 1000 simulations (not included here)
# spatialkernel::plotmc(seg, "Violent crime")
# With a base map and some image and contour functions we can display both the probabilities and the significance tests over the area with more control than the plotmc() function.
# The seg object is a list with several components
# The X and Y coordinates of the grid are stored in the $gridx and $gridy elements
# The probabilities of each class of data (violent or non-violent crime) are in a matrix element $p with a column for each class
# The p-value of the significance test is in a similar matrix element called $stpvalue
# Rearranging columns of these matrices into a grid of values can be done with R's matrix() function
# From there you can construct list objects with a vector $x of X-coordinates, $y of Y-coordinates, and $z as the matrix
# You can then feed this to image() or contour() for visualization
# This process may seem complex, but remember that with R you can always write functions to perform complex tasks and those you may repeat often
# For example, to help with the mapping in this exercise you will create a function that builds a map from four different items
# The seg object from 1000 simulations is loaded, as well as the preston_crime points and the preston_osm map image
# Inspect the structure of the spatial segregation object
# str(seg)
# Get the number of columns in the data so we can rearrange to a grid
# ncol <- length(seg$gridx)
# Rearrange the probability column into a grid
# prob_violent <- list(x = seg$gridx,
# y = seg$gridy,
# z = matrix(seg$p[, "Violent crime"],
# ncol = ncol))
# image(prob_violent)
# Rearrange the p-values, but choose a p-value threshold
# p_value <- list(x = seg$gridx,
# y = seg$gridy,
# z = matrix(seg$stpvalue[, "Violent crime"] < 0.05,
# ncol = ncol))
# image(p_value)
# Create a mapping function
# segmap <- function(prob_list, pv_list, low, high){
#
# # background map
# plotRGB(preston_osm)
#
# # p-value areas
# image(pv_list,
# col = c("#00000000", "#FF808080"), add = TRUE)
#
# # probability contours
# contour(prob_list,
# levels = c(low, high),
# col = c("#206020", "red"),
# labels = c("Low", "High"),
# add = TRUE)
#
# # boundary window
# plot(Window(preston_crime), add = TRUE)
# }
#
# # Map the probability and p-value
# segmap(prob_violent, p_value, 0.05, 0.15)
# The sasquatch, or "bigfoot", is a large ape-like creature reported to live in North American forests
# The Bigfoot Field Researchers Organization maintains a database of sightings and allows its use for teaching and research
# A cleaned subset of data in north-west USA has been created as the ppp object sasq and is loaded for you to explore the space-time pattern of sightings in the area
# Get a quick summary of the dataset
sasq <- readRDS("./RInputFiles/sasquatch.RDS")
summary(sasq)
## Marked planar point pattern: 423 points
## Average intensity 2.097156e-09 points per square unit
##
## *Pattern contains duplicated points*
##
## Coordinates are given to 1 decimal place
## i.e. rounded to the nearest multiple of 0.1 units
##
## Mark variables: date, year, month
## Summary:
## date year month
## Min. :1990-05-03 Y2004 : 41 Sep : 59
## 1st Qu.:2000-04-30 Y2000 : 36 Oct : 56
## Median :2003-11-05 Y2002 : 30 Aug : 54
## Mean :2003-08-11 Y2005 : 30 Jul : 50
## 3rd Qu.:2007-11-02 Y2001 : 26 Nov : 43
## Max. :2016-04-05 Y2008 : 26 Jun : 41
## (Other):234 (Other):120
##
## Window: polygonal boundary
## single connected closed polygon with 64 vertices
## enclosing rectangle: [368187.8, 764535.6] x [4644873, 5434933] units
## Window area = 2.01702e+11 square units
## Fraction of frame area: 0.644
# Plot unmarked points
plot(unmark(sasq))
# Plot the points using a circle sized by date
plot(sasq, which.marks = "date")
# Show the available marks
names(marks(sasq))
## [1] "date" "year" "month"
# Histogram the dates of the sightings, grouped by year
hist(marks(sasq)$date, "years", freq = TRUE)
# Plot and tabulate the calendar month of all the sightings
plot(table(marks(sasq)$month))
# Split on the month mark
sasq_by_month <- split(sasq, "month", un = TRUE)
# Plot monthly maps
plot(sasq_by_month)
# Plot smoothed versions of the above split maps
plot(density(sasq_by_month))
# To do a space-time clustering test with stmctest() from the splancs package, you first need to convert parts of your ppp object
# Functions in splancs tend to use matrix data instead of data frames.
# To run stmctest() you need to set up
# event locations
# event times
# region polygon
# time limits
# the time and space ranges for analysis
# The sasq object is loaded and the spatstat and splancs packages are ready for use
# Get a matrix of event coordinates
sasq_xy <- as.matrix(spatstat::coords(sasq))
# Check the matrix has two columns
dim(sasq_xy)
## [1] 423 2
# Get a vector of event times
sasq_t <- marks(sasq)$date
# Extract a two-column matrix from the ppp object
sasq_poly <- as.matrix(as.data.frame(Window(sasq)))
dim(sasq_poly)
## [1] 64 2
# Set the time limit to 1 day before and 1 day after the range of times
tlimits <- range(sasq_t) + c(-1, 1)
# Scan over 400m intervals from 100m to 20km
s <- seq(100, 20000, by = 400)
# Scan over 14 day intervals from one week to 31 weeks
tm <- seq(7, 217, by = 14)
# Everything is now ready for you to run the space-time clustering test function
# You can then plot the results and compute a p-value for rejecting the null hypothesis of no space-time clustering
# Any space-time clustering in a data set will be removed if you randomly rearrange the dates of the data points
# The stmctest() function computes a clustering test statistic for your data based on the space-time K-function - how many points are within a spatial and temporal window of a point of the data
# It then does a number of random rearrangements of the dates among the points and computes the clustering statistic
# After doing this a large number of times, you can compare the test statistic for your data with the values from the random data
# If the test statistic for your data is sufficiently large or small, you can reject the null hypothesis of no space-time clustering
# The output from stmctest() is a list with a single t0 which is the test statistic for your data, and a vector of t from the simulations
# By converting to data frame you can feed this to ggplot functions
# Because the window area is a large number of square meters, and we have about 400 events, the numerical value of the intensity is a very small number
# This makes values of the various K-functions very large numbers, since they are proportional to the inverse of the intensity
# Don't worry if you see 10^10 or higher
# The p-value of a Monte-Carlo test like this is just the proportion of test statistics that are larger than the value from the data
# You can compute this from the t and t0 elements of the output
# All the objects from the previous exercise are loaded.
# Run 999 simulations
sasq_mc <- splancs::stmctest(sasq_xy, sasq_t, sasq_poly, tlimits, s, tm, nsim = 999, quiet = TRUE)
names(sasq_mc)
## [1] "t0" "t"
# Histogram the simulated statistics and add a line at the data value
ggplot(data.frame(sasq_mc), aes(x = t)) +
geom_histogram(binwidth = 1e13) +
geom_vline(aes(xintercept = t0))
# Compute the p-value as the proportion of tests greater than the data
sum(sasq_mc$t > sasq_mc$t0) / 1000
## [1] 0.04
Chapter 3 - Areal Statistics
Areal statistics:
Spatial health data:
Generalized linear models in space:
Correlation in spatial GLM:
Example code includes:
library(cartogram)
library(rgeos)
## rgeos version: 0.3-26, (SVN revision 560)
## GEOS runtime version: 3.6.1-CAPI-1.10.1 r0
## Linking to sp version: 1.2-7
## Polygon checking: TRUE
library(spdep)
## Loading required package: sp
## Loading required package: Matrix
## Loading required package: spData
## To access larger datasets in this package, install the spDataLarge
## package with: `install.packages('spDataLarge')`
##
## Attaching package: 'spData'
## The following objects are masked _by_ '.GlobalEnv':
##
## x, y
library(epitools)
library(R2BayesX)
## Loading required package: BayesXsrc
## Loading required package: colorspace
##
## Attaching package: 'colorspace'
## The following object is masked from 'package:spatstat':
##
## coords
## Loading required package: mgcv
## This is mgcv 1.8-17. For overview type 'help("mgcv-package")'.
# In 2016 the UK held a public vote on whether to remain in the European Union
# The results of the referendum, where people voted either "Remain" or "Leave", are available online
# The data set london_ref contains the results for the 32 boroughs of London, and includes the number and percentage of votes in each category as well as the count of spoilt votes, the population size and the electorate size
# The london_ref object is a SpatialPolygonsDataFrame, a special kind of data frame where each row also has the shape of the borough
# It behaves like a data frame in many respects, but can also be used to plot a choropleth, or shaded polygon, map
# You should start with some simple data exploration and mapping. The following variables will be useful:
# NAME : the name of the borough.
# Electorate : the total number of people who can vote.
# Remain, Leave : the number of votes for "Remain" or "Leave".
# Pct_Remain, Pct_Leave : the percentage of votes for each sid
# spplot() from the raster package provides a convenient way to draw a shaded map of regions
# See what information we have for each borough
london_ref <- readRDS("./RInputFiles/london_eu.RDS")
summary(london_ref)
## Object of class SpatialPolygonsDataFrame
## Coordinates:
## min max
## x 503574.2 561956.7
## y 155850.8 200933.6
## Is projected: TRUE
## proj4string :
## [+proj=tmerc +lat_0=49 +lon_0=-2 +k=0.9996012717 +x_0=400000
## +y_0=-100000 +datum=OSGB36 +units=m +no_defs +ellps=airy
## +towgs84=446.448,-125.157,542.060,0.1502,0.2470,0.8421,-20.4894]
## Data attributes:
## NAME TOTAL_POP Electorate Votes_Cast
## Length:32 Min. :157711 Min. : 83042 Min. : 54801
## Class :character 1st Qu.:237717 1st Qu.:143458 1st Qu.:104079
## Mode :character Median :272017 Median :168394 Median :116280
## Mean :270780 Mean :169337 Mean :118025
## 3rd Qu.:316911 3rd Qu.:196285 3rd Qu.:134142
## Max. :379691 Max. :245349 Max. :182570
## Remain Leave Rejected_Ballots Pct_Remain
## Min. : 27750 Min. :17138 Min. : 60.0 Min. :30.34
## 1st Qu.: 55973 1st Qu.:32138 1st Qu.:105.0 1st Qu.:53.69
## Median : 70254 Median :45263 Median :138.0 Median :61.01
## Mean : 70631 Mean :47255 Mean :139.0 Mean :60.46
## 3rd Qu.: 84287 3rd Qu.:59018 3rd Qu.:164.2 3rd Qu.:69.90
## Max. :118463 Max. :96885 Max. :267.0 Max. :78.62
## Pct_Leave Pct_Rejected Assembly
## Min. :21.38 Min. :0.0600 Length:32
## 1st Qu.:30.10 1st Qu.:0.0875 Class :character
## Median :38.99 Median :0.1100 Mode :character
## Mean :39.54 Mean :0.1187
## 3rd Qu.:46.31 3rd Qu.:0.1500
## Max. :69.66 Max. :0.2200
# Which boroughs voted to "Leave"?
london_ref$NAME[london_ref$Leave > london_ref$Remain]
## [1] "Sutton" "Barking and Dagenham" "Bexley"
## [4] "Havering" "Hillingdon"
# Plot a map of the percentage that voted "Remain"
sp::spplot(london_ref, zcol = "Pct_Remain")
# Large areas, such as cities or countries, are often divided into smaller administrative units, often into zones of approximately equal population
# But the area of those units may vary considerably
# When mapping them, the large areas carry more visual "weight" than small areas, although just as many people live in the small areas.
# One technique for correcting for this is the cartogram
# This is a controlled distortion of the regions, expanding some and contracting others, so that the area of each region is proportional to a desired quantity, such as the population
# The cartogram also tries to maintain the correct geography as much as possible, by keeping regions in roughly the same place relative to each other
# The cartogram package contains functions for creating cartograms
# You give it a spatial data frame and the name of a column, and you get back a similar data frame but with regions distorted so that the region area is proportional to the column value of the regions
# You'll also use the rgeos package for computing the areas of individual regions with the gArea() function
# Use the cartogram and rgeos packages (called at top of routine)
# library(cartogram)
# library(rgeos)
# Make a scatterplot of electorate vs borough area
names(london_ref)
## [1] "NAME" "TOTAL_POP" "Electorate"
## [4] "Votes_Cast" "Remain" "Leave"
## [7] "Rejected_Ballots" "Pct_Remain" "Pct_Leave"
## [10] "Pct_Rejected" "Assembly"
plot(london_ref$Electorate, gArea(london_ref, byid = TRUE))
# Make a cartogram, scaling the area to the electorate
carto_ref <- cartogram(london_ref, "Electorate")
## Mean size error for iteration 1: 1.5881743190908
## Mean size error for iteration 2: 1.32100446455657
## Mean size error for iteration 3: 1.18227723476121
## Mean size error for iteration 4: 1.10676057030171
## Mean size error for iteration 5: 1.0657703107641
## Mean size error for iteration 6: 1.04259259672006
## Mean size error for iteration 7: 1.02832326230708
## Mean size error for iteration 8: 1.01931941526112
## Mean size error for iteration 9: 1.01341424685212
## Mean size error for iteration 10: 1.00941370606418
## Mean size error for iteration 11: 1.00663364742297
## Mean size error for iteration 12: 1.00470553629914
## Mean size error for iteration 13: 1.00336434720465
## Mean size error for iteration 14: 1.00241457265516
## Mean size error for iteration 15: 1.00174179254187
plot(carto_ref)
# Check the linearity of the electorate-area plot
plot(carto_ref$Electorate, gArea(carto_ref, byid = TRUE))
# Make a fairer map of the Remain percentage
sp::spplot(carto_ref, "Pct_Remain")
# The map of "Remain" votes seems to have spatial correlation
# Pick any two boroughs that are neighbors - with a shared border - and the chances are they'll be more similar than any two random boroughs
# This can be a problem when using statistical models that assume, conditional on the model, that the data points are independent
# The spdep package has functions for measures of spatial correlation, also known as spatial dependency
# Computing these measures first requires you to work out which regions are neighbors via the poly2nb() function, short for "polygons to neighbors"
# The result is an object of class nb
# Then you can compute the test statistic and run a significance test on the null hypothesis of no spatial correlation
# The significance test can either be done by Monte-Carlo or theoretical models
# In this example you'll use the Moran "I" statistic to test the spatial correlation of the population and the percentage "Remain" vote.
# The london_ref spatial data object is loaded for you
# Use the spdep package (called at top of routine)
# library(spdep)
# Make neighbor list
borough_nb <- poly2nb(london_ref)
# Get center points of each borough
borough_centers <- coordinates(london_ref)
# Show the connections
plot(london_ref)
plot(borough_nb, borough_centers, add = TRUE)
# Map the total pop'n
sp::spplot(london_ref, zcol = "TOTAL_POP")
# Run a Moran I test on total pop'n
moran.test(
london_ref$TOTAL_POP,
nb2listw(borough_nb)
)
##
## Moran I test under randomisation
##
## data: london_ref$TOTAL_POP
## weights: nb2listw(borough_nb)
##
## Moran I statistic standard deviate = 1.2124, p-value = 0.1127
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## 0.11549264 -0.03225806 0.01485190
# Map % Remain
sp::spplot(london_ref, zcol = "Pct_Remain")
# Run a Moran I MC test on % Remain
moran.mc(
london_ref$Pct_Remain,
nb2listw(borough_nb),
nsim = 999
)
##
## Monte-Carlo simulation of Moran I
##
## data: london_ref$Pct_Remain
## weights: nb2listw(borough_nb)
## number of simulations + 1: 1000
##
## statistic = 0.42841, observed rank = 1000, p-value = 0.001
## alternative hypothesis: greater
# The UK's National Health Service publishes weekly data for consultations at a number of "sentinel" clinics and makes this data available
# A dataset for one week in February 2017 has been loaded for you to analyze
# It is called london, and contains data for the 32 boroughs.
# You will focus on reports of "Influenza-like illness", or more simply "Flu"
# Your first task is to map the "Standardized Morbidity Ratio", or SMR
# This is the number of cases per person, but scaled by the overall incidence so that the expected number is 1
# The london object, a spatial data frame, and the sp package are ready for you
# Get a summary of the data set
london <- readRDS("./RInputFiles/london_2017_2.RDS")
summary(london)
## Object of class SpatialPolygonsDataFrame
## Coordinates:
## min max
## x 503574.2 561956.7
## y 155850.8 200933.6
## Is projected: TRUE
## proj4string :
## [+proj=tmerc +lat_0=49 +lon_0=-2 +k=0.9996012717 +x_0=400000
## +y_0=-100000 +datum=OSGB36 +units=m +no_defs +ellps=airy
## +towgs84=446.448,-125.157,542.060,0.1502,0.2470,0.8421,-20.4894]
## Data attributes:
## CODE NAME Flu_OBS Vom_OBS
## Length:32 Length:32 Min. : 0.00 Min. : 0.00
## Class :character Class :character 1st Qu.: 11.00 1st Qu.:14.00
## Mode :character Mode :character Median : 33.00 Median :40.00
## Mean : 38.56 Mean :37.28
## 3rd Qu.: 61.00 3rd Qu.:57.50
## Max. :112.00 Max. :96.00
## Diarr_OBS Gastro_OBS TOTAL_POP CCGcode
## Min. : 0.00 Min. : 0.0 Min. :157711 Length:32
## 1st Qu.: 22.50 1st Qu.: 48.0 1st Qu.:237717 Class :character
## Median : 62.00 Median :120.5 Median :272017 Mode :character
## Mean : 57.03 Mean :113.7 Mean :270780
## 3rd Qu.: 90.75 3rd Qu.:176.8 3rd Qu.:316911
## Max. :122.00 Max. :251.0 Max. :379691
## CCG.geography.code CCG.name Asthma_Prevalence
## Length:32 Length:32 Min. :3.550
## Class :character Class :character 1st Qu.:4.412
## Mode :character Mode :character Median :4.660
## Mean :4.624
## 3rd Qu.:4.925
## Max. :5.470
## Obesity_Prevalence Cancer_Prevalence Diabetes_Prevalence Income
## Min. : 3.930 Min. :0.870 Min. :3.620 Min. :0.0730
## 1st Qu.: 5.855 1st Qu.:1.438 1st Qu.:5.265 1st Qu.:0.1308
## Median : 7.565 Median :1.605 Median :6.305 Median :0.1665
## Mean : 7.585 Mean :1.684 Mean :6.245 Mean :0.1655
## 3rd Qu.: 8.810 3rd Qu.:1.903 3rd Qu.:7.067 3rd Qu.:0.1985
## Max. :12.130 Max. :2.540 Max. :9.060 Max. :0.2530
## Employment Education HealthDeprivation Crime
## Min. :0.0570 Min. : 3.958 Min. :-1.4100 Min. :-0.1550
## 1st Qu.:0.0905 1st Qu.:10.047 1st Qu.:-0.5055 1st Qu.: 0.3745
## Median :0.1095 Median :13.925 Median :-0.2050 Median : 0.5515
## Mean :0.1092 Mean :14.024 Mean :-0.2044 Mean : 0.5379
## 3rd Qu.:0.1283 3rd Qu.:17.480 3rd Qu.: 0.2010 3rd Qu.: 0.7823
## Max. :0.1560 Max. :27.182 Max. : 0.5430 Max. : 1.0190
## Services Environment i
## Min. :19.63 Min. :13.37 Min. : 0.00
## 1st Qu.:24.43 1st Qu.:24.03 1st Qu.: 7.75
## Median :30.41 Median :28.20 Median :15.50
## Mean :29.55 Mean :31.38 Mean :15.50
## 3rd Qu.:34.74 3rd Qu.:40.15 3rd Qu.:23.25
## Max. :41.89 Max. :55.00 Max. :31.00
# Map the OBServed number of flu reports
sp::spplot(london, "Flu_OBS")
# Compute and print the overall incidence of flu
r <- sum(london$Flu_OBS) / sum(london$TOTAL_POP)
r
## [1] 0.0001424128
# Calculate the expected number for each borough
london$Flu_EXP <- london$TOTAL_POP * r
# Calculate the ratio of OBServed to EXPected
london$Flu_SMR <- london$Flu_OBS / london$Flu_EXP
# Map the SMR
sp::spplot(london, "Flu_SMR")
# SMRs above 1 represent high rates of disease - but how high does an SMR need to be before it can be considered statistically significant?
# Given a number of cases and a population, its possible to work out confidence intervals at some level of the estimate of the ratio of cases per population using the properties of the binomial distribution
# The epitools package has a function binom.exact() which you can use to compute confidence intervals for the flu data
# These can be scaled to be confidence intervals on the SMR by dividing by the overall rate
# The london data set and the sp package are loaded
# For the binomial statistics function (called at top of routine)
# library(epitools)
# Get CI from binomial distribution
flu_ci <- binom.exact(london$Flu_OBS, london$TOTAL_POP)
# Add borough names
flu_ci$NAME <- london$NAME
# Calculate London rate, then compute SMR
r <- sum(london$Flu_OBS) / sum(london$TOTAL_POP)
flu_ci$SMR <- flu_ci$proportion / r
# Subset the high SMR data
flu_high <- flu_ci[flu_ci$SMR > 1, ]
# Plot estimates with CIs
ggplot(flu_high, aes(x = NAME, y = proportion / r, ymin = lower / r, ymax = upper / r)) +
geom_pointrange() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Distributions and confidence intervals can be difficult things to present to non-statisticians
# An alternative is to present a probability that a value is over a threshold
# For example, public health teams might be interested in when an SMR has more than doubled, and as a statistician you can give a probability that this has happened
# Then the public health team might decide to go to some alert level when the probability of a doubling of SMR is over 0.95
# Again, the properties of the binomial distribution let you compute this for proportional data
# You can then map these exceedence probabilities for some threshold, and use a sensible color scheme to highlight probabilities close to 1
# The london data set has been loaded, and the expected flu case count, Flu_EXP has been computed
# Probability of a binomial exceeding a multiple
binom.exceed <- function(observed, population, expected, e){
1 - pbinom(e * expected, population, prob = observed / population)
}
# Compute P(rate > 2)
london$Flu_gt_2 <- binom.exceed(
observed = london$Flu_OBS,
population = london$TOTAL_POP,
expected = london$Flu_EXP,
e = 2)
# Use a 50-color palette that only starts changing at around 0.9
pal <- c(
rep("#B0D0B0", 40),
colorRampPalette(c("#B0D0B0", "orange"))(5),
colorRampPalette(c("orange", "red"))(5)
)
# Plot the P(rate > 2) map
sp::spplot(london, "Flu_gt_2", col.regions = pal, at = seq(0, 1, len = 50))
# A Poisson generalized linear model is a way of fitting count data to explanatory variables
# You get out parameter estimates and standard errors for your explanatory variables, and can get fitted values and residuals
# The glm() function fits Poisson GLMs. It works just like the lm() function, but you also specify a family argument
# The formula has the usual meaning - response on the left of the ~, and explanatory variables on the right
# To cope with count data coming from populations of different sizes, you specify an offset argument
# This adds a constant term for each row of the data in the model. The log of the population is used in the offset term
# The london health data set has been loaded with the sp package also ready.
# Run a Poisson generalized linear model of how the count of flu cases, Flu_OBS, depends on the Health Deprivation index value, HealthDeprivation
# The first argument is the formula (response vairable on the left)
# The family argument is a function, poisson
# Fit a poisson GLM.
model_flu <- glm(
Flu_OBS ~ HealthDeprivation,
offset = log(TOTAL_POP),
data = london,
family = "poisson")
# Is HealthDeprivation significant?
summary(model_flu)
##
## Call:
## glm(formula = Flu_OBS ~ HealthDeprivation, family = "poisson",
## data = london, offset = log(TOTAL_POP))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -9.5361 -4.5285 -0.0499 2.9043 8.2194
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.78190 0.02869 -306.043 <2e-16 ***
## HealthDeprivation 0.65689 0.06797 9.665 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 703.75 on 31 degrees of freedom
## Residual deviance: 605.03 on 30 degrees of freedom
## AIC: 762.37
##
## Number of Fisher Scoring iterations: 5
# Put residuals into the spatial data.
london$Flu_Resid <- residuals(model_flu)
# Map the residuals using spplot
sp::spplot(london, "Flu_Resid")
# A linear model should fit the data and leave uncorrelated residuals
# This applies to non-spatial models, where, for example, fitting a straight line through points on a curve would lead to serially-correlated residuals
# A model on spatial data should aim to have residuals that show no significant spatial correlation
# You can test the model fitted to the flu data using moran.mc() from the spdep package
# Monte Carlo Moran tests were previously discussed in the Spatial autocorrelation test exercise earlier in the chapter
# Compute the neighborhood structure.
borough_nb <- poly2nb(london)
# Test spatial correlation of the residuals.
moran.mc(london$Flu_Resid, listw = nb2listw(borough_nb), nsim = 999)
##
## Monte-Carlo simulation of Moran I
##
## data: london$Flu_Resid
## weights: nb2listw(borough_nb)
## number of simulations + 1: 1000
##
## statistic = 0.15059, observed rank = 925, p-value = 0.075
## alternative hypothesis: greater
# Bayesian statistical models return samples of the parameters of interest (the "posterior" distribution) based on some "prior" distribution which is then updated by the data
# The Bayesian modelling process returns a number of samples from which you can compute the mean, or an exceedence probability, or any other quantity you might compute from a distribution
# Before you fit a model with spatial correlation, you'll first fit the same model as before, but using Bayesian inference
# The london data set has been loaded
# The R2BayesX package provides an interface to the BayesX code.
# The syntax for bayesx() is similar, but the offset has to be specified explicitly from the data frame, the family name is in quotes, and the spatial data frame needs to be turned into a plain data frame
# Run the model fitting and inspect with summary()
# Plot the samples from the Bayesian model
# On the left is the "trace" of samples in sequential order, and on the right is the parameter density
# For this model there is an intercept and a slope for the Health Deprivation score
# The parameter density should correspond with the parameter summary
# Use R2BayesX (called at top of routine)
# library(R2BayesX)
# Fit a GLM
model_flu <- glm(Flu_OBS ~ HealthDeprivation, offset = log(TOTAL_POP),
data = london, family = poisson)
# Summarize it
summary(model_flu)
##
## Call:
## glm(formula = Flu_OBS ~ HealthDeprivation, family = poisson,
## data = london, offset = log(TOTAL_POP))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -9.5361 -4.5285 -0.0499 2.9043 8.2194
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.78190 0.02869 -306.043 <2e-16 ***
## HealthDeprivation 0.65689 0.06797 9.665 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 703.75 on 31 degrees of freedom
## Residual deviance: 605.03 on 30 degrees of freedom
## AIC: 762.37
##
## Number of Fisher Scoring iterations: 5
# Calculate coeff confidence intervals
confint(model_flu)
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) -8.838677 -8.7261843
## HealthDeprivation 0.524437 0.7908841
# Fit a Bayesian GLM
bayes_flu <- bayesx(Flu_OBS ~ HealthDeprivation, offset = log(london$TOTAL_POP),
family = "poisson", data = as.data.frame(london),
control = bayesx.control(seed = 17610407))
# Summarize it
summary(bayes_flu)
## Call:
## bayesx(formula = Flu_OBS ~ HealthDeprivation, data = as.data.frame(london),
## offset = log(london$TOTAL_POP), control = bayesx.control(seed = 17610407),
## family = "poisson")
##
## Fixed effects estimation results:
##
## Parametric coefficients:
## Mean Sd 2.5% 50% 97.5%
## (Intercept) -8.7831 0.0278 -8.8371 -8.7841 -8.7263
## HealthDeprivation 0.6592 0.0659 0.5345 0.6587 0.7900
##
## N = 32 burnin = 2000 method = MCMC family = poisson
## iterations = 12000 step = 10
# Look at the samples from the Bayesian model
plot(samples(bayes_flu))
# You've fitted a non-spatial GLM with BayesX
# You can include a spatially correlated term based on the adjacency structure by adding a term to the formula specifying a spatially correlated model
# Use poly2nb() to compute the neighborhood structure of london to an nb object
# R2BayesX uses its own objects for the adjacency. Convert the nb object to a gra object
# The sx function specifies additional terms to bayesx. Create a term using the "spatial" basis and the gra object for the boroughs to define the map
# Print a summary of the model object. You should see a table of coefficients for the parametric part of the model as in the previous exercise, and then a table of "Smooth terms variance" with one row for the spatial term
# Note that since BayesX can fit many different forms in its sx terms, most of which, like the spatial model here, cannot be simply expressed with a parameter or two
# This table shows the variance of the random effects - for further explanation consult a text on random effects modelling
# Compute adjacency objects
borough_nb <- poly2nb(london)
borough_gra <- nb2gra(borough_nb)
# Fit spatial model
flu_spatial <- bayesx(
Flu_OBS ~ HealthDeprivation + sx(i, bs = "spatial", map = borough_gra),
offset = log(london$TOTAL_POP),
family = "poisson", data = data.frame(london),
control = bayesx.control(seed = 17610407)
)
## Note: created new output directory 'C:/Users/Dave/AppData/Local/Temp/Rtmpa43JUL/bayesx1'!
# Summarize the model
summary(flu_spatial)
## Call:
## bayesx(formula = Flu_OBS ~ HealthDeprivation + sx(i, bs = "spatial",
## map = borough_gra), data = data.frame(london), offset = log(london$TOTAL_POP),
## control = bayesx.control(seed = 17610407), family = "poisson")
##
## Fixed effects estimation results:
##
## Parametric coefficients:
## Mean Sd 2.5% 50% 97.5%
## (Intercept) -9.2311 0.1246 -9.4826 -9.2298 -9.0148
## HealthDeprivation 0.7683 0.5844 -0.3749 0.7775 1.7922
##
## Smooth terms variances:
## Mean Sd 2.5% 50% 97.5% Min Max
## sx(i):mrf 4.6381 1.6822 2.2851 4.3510 8.8104 1.6557 16.266
##
## N = 32 burnin = 2000 method = MCMC family = poisson
## iterations = 12000 step = 10
# As with glm, you can get the fitted values and residuals from your model using the fitted and residuals functions. bayesx models are a bit more complex, since you have the linear predictor and terms from sx elements, such as the spatially correlated term
# The summary function will show you information for the linear model terms and the smoothing terms in two separate tables
# The spatial term is called "sx(i):mrf" - standing for "Markov Random Field"
# Bayesian analysis returns samples from a distribution for our S(x) term at each of the London boroughs
# The fitted function from bayesx models returns summary statistics for each borough
# You'll just look at the mean of that distribution for now
# The model from the BayesX output is available as flu_spatial.
# Summarise the model
summary(flu_spatial)
## Call:
## bayesx(formula = Flu_OBS ~ HealthDeprivation + sx(i, bs = "spatial",
## map = borough_gra), data = data.frame(london), offset = log(london$TOTAL_POP),
## control = bayesx.control(seed = 17610407), family = "poisson")
##
## Fixed effects estimation results:
##
## Parametric coefficients:
## Mean Sd 2.5% 50% 97.5%
## (Intercept) -9.2311 0.1246 -9.4826 -9.2298 -9.0148
## HealthDeprivation 0.7683 0.5844 -0.3749 0.7775 1.7922
##
## Smooth terms variances:
## Mean Sd 2.5% 50% 97.5% Min Max
## sx(i):mrf 4.6381 1.6822 2.2851 4.3510 8.8104 1.6557 16.266
##
## N = 32 burnin = 2000 method = MCMC family = poisson
## iterations = 12000 step = 10
# Map the fitted spatial term only
london$spatial <- fitted(flu_spatial, term = "sx(i):mrf")[, "Mean"]
sp::spplot(london, zcol = "spatial")
# Map the residuals
london$spatial_resid <- residuals(flu_spatial)[, "mu"]
sp::spplot(london, zcol = "spatial_resid")
# Test residuals for spatial correlation
moran.mc(london$spatial_resid, nb2listw(borough_nb), 999)
##
## Monte-Carlo simulation of Moran I
##
## data: london$spatial_resid
## weights: nb2listw(borough_nb)
## number of simulations + 1: 1000
##
## statistic = -0.26922, observed rank = 16, p-value = 0.984
## alternative hypothesis: greater
Chapter 4 - Geostatistics
Geostatistical data:
Variogram:
Kriging predictions:
Automatic kriging:
Wrap up:
Example code includes:
# Your job is to study the acidity (pH) of some Canadian survey data. The survey measurements are loaded into a spatial data object called ca_geo
# ca_geo has been pre-defined
ca_geo <- readRDS("./RInputFiles/ca_geo.RDS")
summary(ca_geo)
## Object of class SpatialPointsDataFrame
## Coordinates:
## min max
## x 542608.7 714269.2
## y 5541290.4 5652558.9
## Is projected: TRUE
## proj4string :
## [+init=epsg:32609 +proj=utm +zone=9 +datum=WGS84 +units=m +no_defs
## +ellps=WGS84 +towgs84=0,0,0]
## Number of points: 1140
## Data attributes:
## ID Elev pH Zn
## 102I881003: 1 Min. : 5.0 Min. :3.900 Min. : 1.00
## 102I881004: 1 1st Qu.: 20.0 1st Qu.:6.100 1st Qu.: 40.00
## 102I881005: 1 Median :110.0 Median :6.600 Median : 62.00
## 102I881006: 1 Mean :183.6 Mean :6.531 Mean : 72.34
## 102I881007: 1 3rd Qu.:310.0 3rd Qu.:7.000 3rd Qu.: 88.00
## 102I881008: 1 Max. :914.0 Max. :8.700 Max. :510.00
## (Other) :1134 NA's :9 NA's :33
## Cu Pb Ni Co
## Min. : 1.00 Min. : 1.000 Min. : 1.00 Min. : 1.00
## 1st Qu.: 21.00 1st Qu.: 1.000 1st Qu.: 7.00 1st Qu.:11.00
## Median : 37.00 Median : 1.000 Median : 20.00 Median :19.00
## Mean : 57.45 Mean : 2.975 Mean : 27.85 Mean :20.16
## 3rd Qu.: 76.00 3rd Qu.: 3.000 3rd Qu.: 37.00 3rd Qu.:27.00
## Max. :2950.00 Max. :195.000 Max. :340.00 Max. :77.00
##
## Ag Mn Fe Mo
## Min. :0.1000 Min. : 2.0 Min. : 0.010 Min. : 1.000
## 1st Qu.:0.1000 1st Qu.: 490.0 1st Qu.: 4.000 1st Qu.: 1.000
## Median :0.1000 Median : 820.0 Median : 5.100 Median : 1.000
## Mean :0.1146 Mean : 959.5 Mean : 5.168 Mean : 1.654
## 3rd Qu.:0.1000 3rd Qu.:1200.0 3rd Qu.: 6.200 3rd Qu.: 2.000
## Max. :7.9000 Max. :9700.0 Max. :24.000 Max. :46.000
##
## U W Sn Hg
## Min. :-1.00 Min. :-1.00 Min. : 1.000 Min. : 5
## 1st Qu.: 0.70 1st Qu.: 1.00 1st Qu.: 1.000 1st Qu.: 60
## Median : 1.10 Median : 1.00 Median : 1.000 Median : 80
## Mean : 1.36 Mean : 1.14 Mean : 1.123 Mean : 232
## 3rd Qu.: 1.70 3rd Qu.: 1.00 3rd Qu.: 1.000 3rd Qu.: 120
## Max. : 9.10 Max. :32.00 Max. :140.000 Max. :20000
##
## As Sb Ba Cd
## Min. : 1.00 Min. : 0.1000 Min. : 5 Min. : 0.100
## 1st Qu.: 5.00 1st Qu.: 0.1000 1st Qu.: 200 1st Qu.: 0.100
## Median : 6.00 Median : 0.1000 Median : 300 Median : 0.100
## Mean : 10.95 Mean : 0.2411 Mean : 301 Mean : 0.165
## 3rd Qu.: 10.00 3rd Qu.: 0.1000 3rd Qu.: 390 3rd Qu.: 0.100
## Max. :360.00 Max. :15.0000 Max. :1800 Max. :14.800
##
## V Bi Cr LoI
## Min. : 2.0 Min. :0.1000 Min. : 5.0 Min. :-1.00
## 1st Qu.:215.0 1st Qu.:0.1000 1st Qu.: 52.0 1st Qu.: 6.20
## Median :295.0 Median :0.1000 Median : 88.0 Median : 9.20
## Mean :318.3 Mean :0.1213 Mean :114.1 Mean :11.45
## 3rd Qu.:410.0 3rd Qu.:0.1000 3rd Qu.:148.0 3rd Qu.:14.00
## Max. :960.0 Max. :4.2000 Max. :860.0 Max. :82.80
##
## F Au
## Min. : 20.0 Min. : 1.00
## 1st Qu.:120.0 1st Qu.: 1.00
## Median :150.0 Median : 2.00
## Mean :164.2 Mean : 24.55
## 3rd Qu.:200.0 3rd Qu.: 5.00
## Max. :620.0 Max. :5800.00
##
str(ca_geo, 1)
## Formal class 'SpatialPointsDataFrame' [package "sp"] with 5 slots
# See what measurements are at each location
names(ca_geo)
## [1] "ID" "Elev" "pH" "Zn" "Cu" "Pb" "Ni" "Co" "Ag" "Mn"
## [11] "Fe" "Mo" "U" "W" "Sn" "Hg" "As" "Sb" "Ba" "Cd"
## [21] "V" "Bi" "Cr" "LoI" "F" "Au"
# Get a summary of the acidity (pH) values
summary(ca_geo$pH)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 3.900 6.100 6.600 6.531 7.000 8.700 33
# Look at the distribution
hist(ca_geo$pH)
# Make a vector that is TRUE for the missing data
miss <- is.na(ca_geo$pH)
table(miss)
## miss
## FALSE TRUE
## 1107 33
# Plot a map of acidity
spplot(ca_geo[!miss, ], "pH")
# The acidity data shows pH broadly increasing from north-east to south-west. Fitting a linear model with the coordinates as covariates will interpolate a flat plane through the values
# ca_geo has been pre-defined
str(ca_geo, 1)
## Formal class 'SpatialPointsDataFrame' [package "sp"] with 5 slots
# Are they called lat-long, up-down, or what?
coordnames(ca_geo)
## [1] "x" "y"
# Complete the formula
m_trend <- lm(pH ~ x + y, as.data.frame(ca_geo))
# Check the coefficients
summary(m_trend)
##
## Call:
## lm(formula = pH ~ x + y, data = as.data.frame(ca_geo))
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.83561 -0.32091 -0.00761 0.33188 2.06249
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.358e+01 3.002e+00 27.84 <2e-16 ***
## x -5.691e-06 3.483e-07 -16.34 <2e-16 ***
## y -1.313e-05 5.319e-07 -24.68 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5299 on 1104 degrees of freedom
## (33 observations deleted due to missingness)
## Multiple R-squared: 0.4237, Adjusted R-squared: 0.4227
## F-statistic: 405.9 on 2 and 1104 DF, p-value: < 2.2e-16
# Your next task is to compute the pH at the locations that have missing data in the source. You can use the predict() function on the fitted model from the previous exercise for this
# ca_geo, miss, m_trend have been pre-defined
# ls.str()
# Make a vector that is TRUE for the missing data
miss <- is.na(ca_geo$pH)
# Create a data frame of missing data
ca_geo_miss <- as.data.frame(ca_geo)[miss, ]
# Predict pH for the missing data
predictions <- predict(m_trend, newdata = ca_geo_miss, se.fit = TRUE)
# Compute the exceedence probability
pAlkaline <- 1 - pnorm(7, mean = predictions$fit, sd = predictions$se.fit)
hist(pAlkaline)
# You can use the gstat package to plot variogram clouds and the variograms from data. Recall:
# The variogram cloud shows the differences of the measurements against distance for all pairs of data points
# The binned variogram divides the cloud into distance bins and computes the average difference within each bin
# The y-range of the binned variogram is always much smaller than the variogram cloud because the cloud includes the full range of values that go into computing the mean for the binned variogram
# The acidity survey data, ca_geo and the missing value index, miss have been pre-defined
# The gstat variogram() function uses the cloud argument to plot a variogram cloud - the default cloud parameter is FALSE
# ca_geo, miss have been pre-defined
# ls.str()
# Make a cloud from the non-missing data up to 10km
plot(gstat::variogram(pH ~ 1, ca_geo[!miss, ], cloud = TRUE, cutoff = 10000))
# Make a variogram of the non-missing data
plot(gstat::variogram(pH ~ 1, ca_geo[!miss, ]))
# You might imagine that if soil at a particular point is alkaline, then soil one metre away is likely to be alkaline too
# But can you say the same thing about soil one kilometre away, or ten kilometres, or one hundred kilometres?
# The shape of the previous variogram tells you there is a large-scale trend in the data
# You can fit a variogram considering this trend with gstat
# This variogram should flatten out, indicating there is no more spatial correlation after a certain distance with the trend taken into account
# ca_geo, miss have been pre-defined
# ls.str()
# See what coordinates are called
coordnames(ca_geo)
## [1] "x" "y"
# The pH depends on the coordinates
ph_vgm <- gstat::variogram(pH ~ x + y, ca_geo[!miss, ])
plot(ph_vgm)
# Next you'll fit a model to your variogram
# The gstat function fit.variogram() does this
# You need to give it some initial values as a starting point for the optimization algorithm to fit a better model
# The sill is the the upper limit of the model
# That is, the long-range largest value, ignoring any outliers
# A variogram has been plotted for you, and ph_vgm has been pre-defined
# Estimate some parameters by eyeballing the plot
# The nugget is the value of the semivariance at zero distance.
# The partial sill, psill is the difference between the sill and the nugget.
# Set the range to the distance at which the variogram has got about half way between the nugget and the sill
# Fit a variogram model by calling fit.variogram()
# The second argument should take the parameters you estimated, wrapped in a call to vgm()
# ca_geo, miss, ph_vgm have been pre-defined
# ls.str()
# Eyeball the variogram and estimate the initial parameters
nugget <- 0.16
psill <- 0.15
range <- 10000
# Fit the variogram
v_model <- gstat::fit.variogram(
ph_vgm,
model = gstat::vgm(
model = "Ste",
nugget = nugget,
psill = psill,
range = range,
kappa = 0.5
)
)
# Show the fitted variogram on top of the binned variogram
plot(ph_vgm, model = v_model)
print(v_model)
## model psill range kappa
## 1 Nug 0.1545116 0.00 0.0
## 2 Ste 0.1442007 14379.29 0.5
# The final part of geostatical estimation is kriging itself
# This is the application of the variogram along with the sample data points to produce estimates and uncertainties at new locations
# The computation of estimates and uncertainties, together with the assumption of a normal (Gaussian) response means you can compute any function of the estimates - for example the probability of a new location having alkaline soil
# The acidity survey data, ca_geo, the missing value index, miss, and the variogram model, v_model, have been pre-defined
# ca_geo, miss, v_model have been pre-defined
# ls.str()
# Set the trend formula and the new data
km <- gstat::krige(pH ~ x + y, ca_geo[!miss, ], newdata = ca_geo[miss, ], model = v_model)
## [using universal kriging]
names(km)
## [1] "var1.pred" "var1.var"
# Plot the predicted values
spplot(km, "var1.pred")
# Compute the probability of alkaline samples, and map
km$pAlkaline <- 1 - pnorm(7, mean = km$var1.pred, sd = sqrt(km$var1.var))
spplot(km, "pAlkaline")
# You have been asked to produce an alkaline probability map over the study area
# To do this, you are going to do some kriging via the krige() function
# This requires a SpatialPixels object which will take a bit of data manipulation to create
# You start by defining a grid, creating points on that grid, cropping to the study region, and then finally converting to SpatialPixels
# On the way, you'll meet some new functions
# GridTopology() defines a rectangular grid. It takes three vectors of length two as inputs
# The first specifies the position of the bottom left corner of the grid
# The second specifies the width and height of each rectangle in the grid, and the third specifies the number of rectangles in each direction
# To ensure that the grid and the study area have the same coordinates, some housekeeping is involved
# SpatialPoints() converts the points to a coordinate reference system (CRS), or projection (different packages use different terminology for the same concept)
# The CRS is created by wrapping the study area in projection(), then in CRS()
# For the purpose of this exercise, you don't need to worry about exactly what these functions do, only that this data manipulation is necessary to align the grid and the study area
# Now that you have that alignment, crop(), as the name suggests, crops the grid to the study area
# Finally, SpatialPixels() converts the raster cropped gridpoints to the equivalent sp object
# The acidity survey data, ca_geo, the missing value index, miss, the variogram, vgm, and the variogram model, v_model, have been pre-defined
# A rough outline of the study area is in an object called geo_bounds
# ca_geo, geo_bounds have been pre-defined
# ls.str()
# Plot the polygon and points
geo_bounds <- readRDS("./RInputFiles/ca_geo_bounds.RDS")
plot(geo_bounds)
points(ca_geo)
# Find the corners of the boundary
bbox(geo_bounds)
## min max
## x 537853.4 719269.2
## y 5536290.4 5657400.9
# Define a 2.5km square grid over the polygon extent. The first parameter is
# the bottom left corner.
grid <- GridTopology(c(537853, 5536290), c(2500, 2500), c(72, 48))
# Create points with the same coordinate system as the boundary
gridpoints <- SpatialPoints(grid, proj4string = CRS(raster::projection(geo_bounds)))
plot(gridpoints)
# Crop out the points outside the boundary
cropped_gridpoints <- raster::crop(gridpoints, geo_bounds)
plot(cropped_gridpoints)
# Convert to SpatialPixels
spgrid <- SpatialPixels(cropped_gridpoints)
coordnames(spgrid) <- c("x", "y")
plot(spgrid)
# The spatial pixel grid of the region, spgrid, and the variogram model of pH, v_model have been pre-defined
# spgrid, v_model have been pre-defined
# ls.str()
# Do kriging predictions over the grid
ph_grid <- gstat::krige(pH ~ x + y, ca_geo[!miss, ], newdata = spgrid, model = v_model)
## [using universal kriging]
# Calc the probability of pH exceeding 7
ph_grid$pAlkaline <- 1 - pnorm(7, mean = ph_grid$var1.pred, sd = sqrt(ph_grid$var1.var))
# Map the probability of alkaline samples
spplot(ph_grid, zcol = "pAlkaline")
# The autoKrige() function in the automap package computes binned variograms, fits models, does model selection, and performs kriging by making multiple calls to the gstat functions you used previously
# It can be a great time-saver but you should always check the results carefully.
# autoKrige() can try several variogram model types
# In the example, you'll use a Matern variogram model, which is commonly used in soil and forestry analyses
# You can see a complete list of available models by calling vgm() with no arguments
# The acidity survey data, ca_geo, and the missing value index, miss, have been pre-defined
# ca_geo, miss are pre-defined
# ls.str()
# Kriging with linear trend, predicting over the missing points
ph_auto <- automap::autoKrige(
pH ~ x + y,
input_data = ca_geo[!miss, ],
new_data = ca_geo[miss, ],
model = "Mat"
)
## [using universal kriging]
# Plot the variogram, predictions, and standard error
plot(ph_auto)
# You can also use autoKrige() over the spgrid grid from the earlier exercise
# This brings together all the concepts that you've learned in the chapter
# That is, kriging is great for predicting missing data, plotting things on a grid is much clearer than plotting individual points, and automatic kriging is less hassle than manual kriging
# The acidity survey data, ca_geo, the missing value index, miss, the spatial pixel grid of the region, spgrid, the manual kriging grid model, ph_grid, and the variogram model of pH, v_model have been pre-defined
# ca_geo, miss, spgrid, ph_grid, v_model are pre-defined
# ls.str()
# Auto-run the kriging
ph_auto_grid <- automap::autoKrige(pH ~ x + y, input_data = ca_geo[!miss, ], new_data = spgrid)
## [using universal kriging]
# Remember predictions from manual kriging
plot(ph_grid)
# Plot predictions and variogram fit
plot(ph_auto_grid)
# Compare the variogram model to the earlier one
v_model
## model psill range kappa
## 1 Nug 0.1545116 0.00 0.0
## 2 Ste 0.1442007 14379.29 0.5
ph_auto_grid$var_model
## model psill range kappa
## 1 Nug 0.1846444 0.00 0
## 2 Ste 0.1092281 13085.13 2
Chapter 1 - Vector and Raster Spatial Data in R
Reading vector and raster data into R:
Getting to know your vector data:
Getting to know your raster data:
Example code includes:
# Load the sf package
library(sf)
## Linking to GEOS 3.6.1, GDAL 2.2.0, proj.4 4.9.3
# Read in the trees shapefile
trees <- st_read("./RInputFiles/ZIP Files/trees/trees.shp")
## Reading layer `trees' from data source `C:\Users\Dave\Documents\Personal\Learning\Coursera\RDirectory\RHomework\DataCamp\RInputFiles\ZIP Files\trees\trees.shp' using driver `ESRI Shapefile'
## Simple feature collection with 65217 features and 7 fields
## geometry type: POINT
## dimension: XY
## bbox: xmin: -74.2546 ymin: 40.49894 xmax: -73.70078 ymax: 40.91165
## epsg (SRID): 4326
## proj4string: +proj=longlat +ellps=WGS84 +no_defs
# Read in the neighborhood shapefile
neighborhoods <- st_read("./RInputFiles/ZIP Files/neighborhoods/neighborhoods.shp")
## Reading layer `neighborhoods' from data source `C:\Users\Dave\Documents\Personal\Learning\Coursera\RDirectory\RHomework\DataCamp\RInputFiles\ZIP Files\neighborhoods\neighborhoods.shp' using driver `ESRI Shapefile'
## Simple feature collection with 195 features and 5 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: -74.25559 ymin: 40.49612 xmax: -73.70001 ymax: 40.91553
## epsg (SRID): 4326
## proj4string: +proj=longlat +ellps=WGS84 +no_defs
# Read in the parks shapefile
parks <- st_read("./RInputFiles/ZIP Files/parks/parks.shp")
## Reading layer `parks' from data source `C:\Users\Dave\Documents\Personal\Learning\Coursera\RDirectory\RHomework\DataCamp\RInputFiles\ZIP Files\parks\parks.shp' using driver `ESRI Shapefile'
## Simple feature collection with 2008 features and 14 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: -74.25609 ymin: 40.49449 xmax: -73.70905 ymax: 40.91133
## epsg (SRID): 4326
## proj4string: +proj=longlat +datum=WGS84 +no_defs
# View the first few trees
head(trees)
## Simple feature collection with 6 features and 7 fields
## geometry type: POINT
## dimension: XY
## bbox: xmin: -74.13116 ymin: 40.62351 xmax: -73.80057 ymax: 40.77393
## epsg (SRID): 4326
## proj4string: +proj=longlat +ellps=WGS84 +no_defs
## tree_id nta longitude latitude stump_diam species boroname
## 1 558423 QN76 -73.80057 40.67035 0 honeylocust Queens
## 2 286191 MN32 -73.95422 40.77393 0 Callery pear Manhattan
## 3 257044 QN70 -73.92309 40.76196 0 Chinese elm Queens
## 4 603262 BK09 -73.99866 40.69312 0 cherry Brooklyn
## 5 41769 SI22 -74.11773 40.63166 0 cherry Staten Island
## 6 24024 SI07 -74.13116 40.62351 0 red maple Staten Island
## geometry
## 1 POINT (-73.80057 40.67035)
## 2 POINT (-73.95422 40.77393)
## 3 POINT (-73.92309 40.76196)
## 4 POINT (-73.99866 40.69312)
## 5 POINT (-74.11773 40.63166)
## 6 POINT (-74.13116 40.62351)
# The term "raster" refers to gridded data that can include satellite imagery, aerial photographs (like orthophotos) and other types
# In R, raster data can be handled using the raster package created by Robert J. Hijmans
# When working with raster data, one of the most important things to keep in mind is that the raw data can be what is known as "single-band" or "multi-band" and these are handled a little differently in R
# Single-band rasters are the simplest, these have a single layer of raster values -- a classic example would be an elevation raster where each cell value represents the elevation at that location
# Multi-band rasters will have more than one layer. An example is a color aerial photo in which there would be one band each representing red, green or blue light.
# Load the raster package
library(raster)
## Loading required package: sp
##
## Attaching package: 'raster'
## The following objects are masked from 'package:spatstat':
##
## area, rotate, shift
## The following object is masked from 'package:nlme':
##
## getData
## The following object is masked from 'package:dplyr':
##
## select
# Read in the tree canopy single-band raster
canopy <- raster("./RInputFiles/ZIP Files/canopy/canopy.tif")
# Read in the manhattan Landsat image multi-band raster
manhattan <- brick("./RInputFiles/ZIP Files/manhattan/manhattan.tif")
# Get the class for the new objects
class(canopy)
## [1] "RasterLayer"
## attr(,"package")
## [1] "raster"
class(manhattan)
## [1] "RasterBrick"
## attr(,"package")
## [1] "raster"
# Identify how many layers each object has
nlayers(canopy)
## [1] 1
nlayers(manhattan)
## [1] 3
# As mentioned in the video, spatial objects in sf are just data frames with some special properties
# This means that packages like dplyr can be used to manipulate sf objects
# In this exercise, you will use the dplyr functions select() to select or drop variables, filter() to filter the data and mutate() to add or alter columns
# Load the dplyr and sf packages
# library(dplyr)
# library(sf)
# Read in the trees shapefile (already read in above)
# trees <- st_read("trees.shp")
# Use filter() to limit to honey locust trees
honeylocust <- trees %>% filter(species == "honeylocust")
# Count the number of rows
nrow(honeylocust)
## [1] 6418
# Limit to tree_id and boroname variables
honeylocust_lim <- honeylocust %>% dplyr::select(tree_id, boroname)
# Use head() to look at the first few records
head(honeylocust_lim)
## Simple feature collection with 6 features and 2 fields
## geometry type: POINT
## dimension: XY
## bbox: xmin: -73.97524 ymin: 40.67035 xmax: -73.80057 ymax: 40.83136
## epsg (SRID): 4326
## proj4string: +proj=longlat +ellps=WGS84 +no_defs
## tree_id boroname geometry
## 1 558423 Queens POINT (-73.80057 40.67035)
## 2 548625 Brooklyn POINT (-73.97524 40.68575)
## 3 549439 Brooklyn POINT (-73.94137 40.67557)
## 4 181757 Brooklyn POINT (-73.89377 40.67169)
## 5 379387 Queens POINT (-73.8221 40.69365)
## 6 383562 Bronx POINT (-73.90041 40.83136)
# In this exercise, you will convert the data frame to what's called a tibble with tibble::as_tibble() (Note that dplyr::tbl_df() is now deprecated)
# tibble is loaded in your workspace
# Create a standard, non-spatial data frame with one column
df <- data.frame(a = 1:3)
# Add a list column to your data frame
df$b <- list(1:4, 1:5, 1:10)
# Look at your data frame with head
head(df)
## a b
## 1 1 1, 2, 3, 4
## 2 2 1, 2, 3, 4, 5
## 3 3 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
# Convert your data frame to a tibble and print on console
as_tibble(df)
## # A tibble: 3 x 2
## a b
## <int> <list>
## 1 1 <int [4]>
## 2 2 <int [5]>
## 3 3 <int [10]>
# Pull out the third observation from both columns individually
df$a[3]
## [1] 3
df$b[3]
## [[1]]
## [1] 1 2 3 4 5 6 7 8 9 10
# There are several functions in sf that allow you to access geometric information like area from your vector features
# For example, the functions st_area() and st_length() return the area and length of your features, respectively
# Note that the result of functions like st_area() and st_length() will not be a traditional vector
# Instead the result has a class of units which means the vector result is accompanied by metadata describing the object's units
# you need to either remove the units with unclass()
# or you need to convert val's class to units such as with units(val) <- units(result)
# sf and dplyr are loaded in your workspace
# Read in the parks shapefile (already read in above)
# parks <- st_read("parks.shp")
# Compute the areas of the parks
areas <- st_area(parks)
# Create a quick histogram of the areas using hist
hist(areas, xlim = c(0, 200000), breaks = 1000)
# Filter to parks greater than 30000 (square meters)
big_parks <- parks %>% filter(unclass(areas) > 30000)
# Plot just the geometry of big_parks
plot(st_geometry(big_parks))
# Plot the parks object using all defaults
plot(parks)
## Warning: plotting the first 9 out of 14 attributes; use max.plot = 14 to
## plot all
# Plot just the acres attribute of the parks data
plot(parks["acres"])
# Create a new object of just the parks geometry
parks_geo <- st_geometry(parks)
# Plot the geometry of the parks data
plot(parks_geo)
# Load the raster package
# library(raster)
# Read in the rasters (done previously)
# canopy <- raster("canopy.tif")
# manhattan <- brick("manhattan.tif")
# Get the extent of the canopy object
extent(canopy)
## class : Extent
## xmin : 1793685
## xmax : 1869585
## ymin : 2141805
## ymax : 2210805
# Get the CRS of the manhattan object
crs(manhattan)
## CRS arguments:
## +proj=utm +zone=18 +datum=WGS84 +units=m +no_defs +ellps=WGS84
## +towgs84=0,0,0
# Determine the number of grid cells in both raster objects
ncell(manhattan)
## [1] 619173
ncell(canopy)
## [1] 58190
# Raster data can be very big depending on the extent and resolution (grid size)
# In order to deal with this the raster() and brick() functions are designed to only read in the actual raster values as needed
# To show that this is true, you can use the inMemory() function on an object and it will return FALSE if the values are not in memory
# If you use the head() function, the raster package will read in only the values needed, not the full set of values
# The raster values will be read in by default if you perform spatial analysis operations that require it or you can read in the values from a raster manually with the function getValues()
graphics.off()
# Check if the data is in memory
inMemory(canopy)
## [1] FALSE
# Use head() to peak at the first few records
head(canopy)
## 1 2 3 4 5 6 7 8 9 10 11 12
## 1 0.00 19.35 47.88 17.17 54.27 70.93 81.18 84.23 88.86 87.17 82.27 81.65
## 2 0.00 10.65 58.61 28.77 51.19 53.65 85.29 88.81 89.00 84.59 79.00 87.18
## 3 0.00 0.00 17.26 28.72 49.04 43.84 76.13 83.78 88.30 76.47 84.44 69.35
## 4 0.00 0.96 23.81 64.48 38.24 36.16 79.26 87.02 83.80 70.21 34.33 16.14
## 5 0.00 6.97 38.18 81.83 48.02 52.84 71.18 87.21 76.72 72.90 7.12 47.38
## 6 14.89 31.42 34.17 51.29 85.26 70.05 74.99 83.52 83.98 75.74 41.93 84.72
## 7 65.06 37.87 30.91 23.92 35.21 53.85 85.32 85.59 85.63 76.34 77.72 81.97
## 8 68.95 43.38 37.51 22.02 27.54 72.25 84.80 86.20 86.14 84.44 82.56 67.32
## 9 58.23 33.00 43.03 12.07 19.64 76.00 76.35 76.53 83.94 85.48 83.76 41.86
## 10 46.31 53.63 23.67 10.73 48.16 60.86 63.47 69.98 61.79 55.78 34.47 49.32
## 13 14 15 16 17 18 19 20
## 1 77.95 80.72 64.98 80.05 66.04 34.45 28.03 54.67
## 2 73.62 84.57 72.14 83.08 72.75 13.29 33.37 42.57
## 3 66.02 83.02 72.60 77.81 60.89 49.96 27.29 41.00
## 4 72.36 82.51 72.86 78.84 70.87 30.37 47.14 55.99
## 5 75.38 87.18 74.90 81.76 60.04 24.07 45.37 52.69
## 6 85.36 84.52 65.53 68.28 59.58 52.38 31.21 33.98
## 7 65.10 63.87 48.16 30.90 39.24 29.74 5.78 14.39
## 8 6.58 51.03 21.22 40.49 29.76 22.16 6.76 46.05
## 9 9.47 23.15 50.56 44.56 19.28 32.92 52.94 43.42
## 10 8.55 21.59 55.36 54.16 45.76 47.89 59.27 47.48
# Use getValues() to read the values into a vector
vals <- getValues(canopy)
# Use hist() to create a histogram of the values
hist(vals)
# The raster package has added useful methods for plotting both single and multi-band rasters
# For single-band rasters or for a map of each layer in a multi-band raster you can simply use plot()
# If you have a multi-band raster with layers for red, green and blue light you can use the plotRGB() function to plot the raster layers together as a single image
# Plot the canopy raster (single raster)
plot(canopy)
# Plot the manhattan raster (as a single image for each layer)
plot(manhattan)
# Plot the manhattan raster as an image
plotRGB(manhattan)
# raster masks dplyr::select
detach("package:raster")
Chapter 2 - Preparing Layers for Spatial Analysis
Quick refresher on coordinate reference systems (CRS):
Manipulating vector layers with dplyr:
Converting sf objects into sp objects and coordinates:
Manipulating raster layers:
Example code includes:
library(sf)
library(raster)
##
## Attaching package: 'raster'
## The following objects are masked from 'package:spatstat':
##
## area, rotate, shift
## The following object is masked from 'package:nlme':
##
## getData
## The following object is masked from 'package:dplyr':
##
## select
# In order to perform any spatial analysis with more than one layer, your layers should share the same coordinate reference system (CRS) and the first step is determining what coordinate reference system your data has
# To do this you can make use of the sf function st_crs() and the raster function crs()
# When the geographic data you read in with sf already has a CRS defined both sf and raster will recognize and retain it
# When the CRS is not defined you will need to define it yourself using either the EPSG number or the proj4string
# Determine the CRS for the neighborhoods and trees vector objects
st_crs(neighborhoods)
## Coordinate Reference System:
## EPSG: 4326
## proj4string: "+proj=longlat +ellps=WGS84 +no_defs"
st_crs(trees)
## Coordinate Reference System:
## EPSG: 4326
## proj4string: "+proj=longlat +ellps=WGS84 +no_defs"
# Assign the CRS to trees
crs_1 <- "+proj=longlat +ellps=WGS84 +no_defs"
st_crs(trees) <- crs_1
# Determine the CRS for the canopy and manhattan rasters
crs(canopy)
## CRS arguments:
## +proj=aea +lat_1=29.5 +lat_2=45.5 +lat_0=23 +lon_0=-96 +x_0=0
## +y_0=0 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs
crs(manhattan)
## CRS arguments:
## +proj=utm +zone=18 +datum=WGS84 +units=m +no_defs +ellps=WGS84
## +towgs84=0,0,0
# Assign the CRS to manhattan
crs_2 <- "+proj=utm +zone=18 +ellps=GRS80 +datum=NAD83 +units=m +no_defs"
crs(manhattan) <- crs_2
# In this exercise you will transform (sometimes this is called "project") the objects so they share a single CRS
# It is generally best to perform spatial analysis with layers that have a projected CRS (and some functions require this)
# To determine if your object has a projected CRS you can look at the first part of the result from st_crs() or crs() -- if it begins with +proj=longlat then your CRS is unprojected
# Note that you will use method = "ngb" in your call to projectRaster() to prevent distortion in the manhattan image
# Get the CRS from the canopy object
the_crs <- crs(canopy, asText = TRUE)
# Project trees to match the CRS of canopy
trees_crs <- st_transform(trees, crs = the_crs)
# Project neighborhoods to match the CRS of canopy
neighborhoods_crs <- st_transform(neighborhoods, crs = the_crs)
# Project manhattan to match the CRS of canopy
manhattan_crs <- projectRaster(manhattan, crs = the_crs, method = "ngb")
# Look at the CRS to see if they match
st_crs(trees_crs)
## Coordinate Reference System:
## No EPSG code
## proj4string: "+proj=aea +lat_1=29.5 +lat_2=45.5 +lat_0=23 +lon_0=-96 +x_0=0 +y_0=0 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs"
st_crs(neighborhoods_crs)
## Coordinate Reference System:
## No EPSG code
## proj4string: "+proj=aea +lat_1=29.5 +lat_2=45.5 +lat_0=23 +lon_0=-96 +x_0=0 +y_0=0 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs"
crs(manhattan_crs)
## CRS arguments:
## +proj=aea +lat_1=29.5 +lat_2=45.5 +lat_0=23 +lon_0=-96 +x_0=0
## +y_0=0 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs
# If the layers do not share a common CRS they may not align on a plot
# To illustrate, in this exercise, you will initially create a plot with the plot() function and try to add two layers that do not share the same CRS
# You will then transform one layer's CRS to match the other and you will plot this with both the plot() function and functions from the tmap package.
# Note that for this exercise we returned all the layers to their original CRS and did not retain the changes you made in the last exercise
# With the plot() function you can plot multiple layers on the same map by calling plot() multiple times
# You'll need to add the argument add = TRUE to all calls to plot() after the first one and you need to run the code for all layers at once rather than line-by-line
# Plot canopy and neighborhoods (run both lines together)
# Do you see the neighborhoods?
plot(canopy)
plot(neighborhoods$geometry, add = TRUE)
# See if canopy and neighborhoods share a CRS
st_crs(neighborhoods)
## Coordinate Reference System:
## EPSG: 4326
## proj4string: "+proj=longlat +ellps=WGS84 +no_defs"
crs(canopy)
## CRS arguments:
## +proj=aea +lat_1=29.5 +lat_2=45.5 +lat_0=23 +lon_0=-96 +x_0=0
## +y_0=0 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs
# Save the CRS of the canopy layer
the_crs <- crs(canopy, asText = TRUE)
# Transform the neighborhoods CRS to match canopy
neighborhoods_crs <- st_transform(neighborhoods, crs=the_crs)
# Re-run plotting code (run both lines together)
# Do the neighborhoods show up now?
plot(canopy)
plot(neighborhoods_crs$geometry, add = TRUE)
# Simply run the tmap code
tmap::tm_shape(canopy) +
tmap::tm_rgb() +
tmap::tm_shape(neighborhoods_crs) +
tmap::tm_polygons(alpha = 0.5)
# One of the great innovations of sf over sp is the use of data frames for storing spatial objects
# This allows you to slice and dice your spatial data in the same way you do for non-spatial data
# This means you can, for example, apply dplyr verbs directly to your sf object
# One important difference between dplyr with and without spatial data is that the resulting data frames will include the geometry variable unless you explicitly drop it
# If you want to force the geometry to be dropped you would use the sf function st_set_geometry() and you would set the geometry to NULL
# The packages sf and dplyr, and the object trees are loaded in your workspace
# Create a data frame of counts by species
species_counts <- count(trees, species)
# Arrange in descending order
species_counts_desc <- arrange(species_counts, desc(n))
# Use head to see if the geometry column is in the data frame
head(species_counts_desc)
## Simple feature collection with 6 features and 2 fields
## geometry type: MULTIPOINT
## dimension: XY
## bbox: xmin: -74.25443 ymin: 40.49894 xmax: -73.70104 ymax: 40.91165
## epsg (SRID): 4326
## proj4string: +proj=longlat +ellps=WGS84 +no_defs
## # A tibble: 6 x 3
## species n geometry
## <fct> <int> <sf_geometry [degree]>
## 1 London planetree 8709 MULTIPOINT (-74.25408 40.50...
## 2 honeylocust 6418 MULTIPOINT (-74.25426 40.50...
## 3 Callery pear 5902 MULTIPOINT (-74.25443 40.50...
## 4 pin oak 5355 MULTIPOINT (-74.25329 40.50...
## 5 Norway maple 3373 MULTIPOINT (-74.25443 40.50...
## 6 littleleaf linden 3043 MULTIPOINT (-74.25032 40.51...
# Drop the geometry column
species_no_geometry <- st_set_geometry(species_counts_desc, NULL)
# Confirm the geometry column has been dropped
head(species_no_geometry)
## # A tibble: 6 x 2
## species n
## <fct> <int>
## 1 London planetree 8709
## 2 honeylocust 6418
## 3 Callery pear 5902
## 4 pin oak 5355
## 5 Norway maple 3373
## 6 littleleaf linden 3043
# In this exercise you will test joining spatial and non-spatial data. In particular, the trees data you have been working with has a full county name (the variable is called boroname) but does not have the county codes. The neighborhoods file has both a county name (the variable is called boro_name) and the county codes -- neighborhoods are nested within counties
# In this exercise, you will create a non-spatial data frame of county name and county code from the neighborhoods object
# Then you will join this data frame into the spatial trees object with inner_join()
# The packages sf and dplyr and the objects neighborhoods and trees are loaded in your workspace
# Limit to the fields boro_name, county_fip and boro_code
boro <- dplyr::select(neighborhoods, boro_name, county_fip, boro_code)
# Drop the geometry column
boro_no_geometry <- st_set_geometry(boro, NULL)
# Limit to distinct records
boro_distinct <- distinct(boro_no_geometry)
# Join the county detail into the trees object
trees_with_county <- inner_join(trees, boro_distinct, by = c("boroname" = "boro_name"))
# Confirm the new fields county_fip and boro_code exist
head(trees_with_county)
## Simple feature collection with 6 features and 9 fields
## geometry type: POINT
## dimension: XY
## bbox: xmin: -74.13116 ymin: 40.62351 xmax: -73.80057 ymax: 40.77393
## epsg (SRID): 4326
## proj4string: +proj=longlat +ellps=WGS84 +no_defs
## tree_id nta longitude latitude stump_diam species boroname
## 1 558423 QN76 -73.80057 40.67035 0 honeylocust Queens
## 2 286191 MN32 -73.95422 40.77393 0 Callery pear Manhattan
## 3 257044 QN70 -73.92309 40.76196 0 Chinese elm Queens
## 4 603262 BK09 -73.99866 40.69312 0 cherry Brooklyn
## 5 41769 SI22 -74.11773 40.63166 0 cherry Staten Island
## 6 24024 SI07 -74.13116 40.62351 0 red maple Staten Island
## county_fip boro_code geometry
## 1 081 4 POINT (-73.80057 40.67035)
## 2 061 1 POINT (-73.95422 40.77393)
## 3 081 4 POINT (-73.92309 40.76196)
## 4 047 3 POINT (-73.99866 40.69312)
## 5 085 5 POINT (-74.11773 40.63166)
## 6 085 5 POINT (-74.13116 40.62351)
# In sf you can use the st_simplify() function to reduce line and polygon complexity
# In this exercise you will measure the size of objects before and after st_simplify() in two ways
# You will compute the size in megabytes using the handy object_size() function in the pryr package and you will count the number of vertices -- the number of points required to delineate a line or polygon
# The packages sf and pryr are loaded in your workspace
# Plot the neighborhoods geometry
plot(st_geometry(neighborhoods), col = "grey")
# Measure the size of the neighborhoods object
utils::object.size(neighborhoods)
## 1890408 bytes
# Compute the number of vertices in the neighborhoods object
pts_neighborhoods <- st_cast(neighborhoods$geometry, "MULTIPOINT")
cnt_neighborhoods <- sapply(pts_neighborhoods, length)
sum(cnt_neighborhoods)
## [1] 210736
# Simplify the neighborhoods object
neighborhoods_simple <- st_simplify(neighborhoods,
preserveTopology = TRUE,
dTolerance = 0.0025)
## Warning in st_simplify.sfc(st_geometry(x), preserveTopology, dTolerance):
## st_simplify does not correctly simplify longitude/latitude data, dTolerance
## needs to be in decimal degrees
# Measure the size of the neighborhoods_simple object
utils::object.size(neighborhoods_simple)
## 248464 bytes
# Compute the number of vertices in the neighborhoods_simple object
pts_neighborhoods_simple <- st_cast(neighborhoods_simple$geometry, "MULTIPOINT")
cnt_neighborhoods_simple <- sapply(pts_neighborhoods_simple, length)
sum(cnt_neighborhoods_simple)
## [1] 4766
# Plot the neighborhoods_simple object geometry
plot(st_geometry(neighborhoods_simple), col = "grey")
# Read in the trees data (done previously)
# trees <- st_read("trees.shp")
# Convert to Spatial class
trees_sp <- as(trees, Class = "Spatial")
# Confirm conversion, should be "SpatialPointsDataFrame"
class(trees_sp)
## [1] "SpatialPointsDataFrame"
## attr(,"package")
## [1] "sp"
# Convert back to sf
trees_sf <- st_as_sf(trees_sp)
# Confirm conversion
class(trees_sf)
## [1] "sf" "data.frame"
# In order to convert a data frame of coordinates into an sf object you can make use of the st_as_sf() function you used in the previous exercise
# You can specify the coords argument with the names of the coordinate variables (with the X coordinate/longitude coordinate listed first) and, optionally, the crs argument if you know the CRS of your coordinates
# The CRS can be specified as a proj4 string or EPSG code
# If you want to convert your sf point objects to a data frame with coordinates, you can use the st_write() function with a
# hidden argument (these are arguments associated with an external utility called GDAL and so they're not in the R help) to force sf to include the coordinates in the output file
# The argument you need is layer_options = "GEOMETRY=AS_XY"
# Read in the CSV (done previously)
# trees <- read.csv("trees.csv")
# Convert the data frame to an sf object
trees_sf <- st_as_sf(trees, coords = c("longitude", "latitude"), crs = 4326)
# Plot the geometry of the points
plot(st_geometry(trees_sf))
# Write the file out with coordinates
st_write(trees_sf, "./RInputFiles/new_trees.csv", layer_options = "GEOMETRY=AS_XY", delete_dsn = TRUE)
## Deleting source `C:\Users\Dave\Documents\Personal\Learning\Coursera\RDirectory\RHomework\DataCamp\RInputFiles\new_trees.csv' using driver `CSV'
## Writing layer `new_trees' to data source `C:\Users\Dave\Documents\Personal\Learning\Coursera\RDirectory\RHomework\DataCamp\RInputFiles\new_trees.csv' using driver `CSV'
## options: GEOMETRY=AS_XY
## features: 65217
## fields: 7
## geometry type: Point
# Read in the file you just created and check coordinates
new_trees <- read.csv("./RInputFiles/new_trees.csv")
head(new_trees)
## X Y tree_id nta longitude latitude stump_diam
## 1 -73.80057 40.67035 558423 QN76 -73.80057 40.67035 0
## 2 -73.95422 40.77393 286191 MN32 -73.95422 40.77393 0
## 3 -73.92309 40.76196 257044 QN70 -73.92309 40.76196 0
## 4 -73.99866 40.69312 603262 BK09 -73.99866 40.69312 0
## 5 -74.11773 40.63166 41769 SI22 -74.11773 40.63166 0
## 6 -74.13116 40.62351 24024 SI07 -74.13116 40.62351 0
## species boroname
## 1 honeylocust Queens
## 2 Callery pear Manhattan
## 3 Chinese elm Queens
## 4 cherry Brooklyn
## 5 cherry Staten Island
## 6 red maple Staten Island
# Read in the canopy layer (done previously)
# canopy <- raster("canopy.tif")
# Plot the canopy raster
plot(canopy)
# Determine the raster resolution
res(canopy)
## [1] 300 300
# Determine the number of cells
ncell(canopy)
## [1] 58190
# Aggregate the raster
canopy_small <- aggregate(canopy, fact = 10)
# Plot the new canopy layer
plot(canopy_small)
# Determine the new raster resolution
res(canopy_small)
## [1] 3000 3000
# Determine the number of cells in the new raster
ncell(canopy_small)
## [1] 598
# Plot the canopy layer to see the values above 100
plot(canopy)
# Set up the matrix
vals <- cbind(100, 300, NA)
# Reclassify
canopy_reclass <- reclassify(canopy, rcl = vals)
# Plot again and confirm that the legend stops at 100
plot(canopy_reclass)
# raster masks dplyr::select
detach("package:raster")
Chapter 3 - Conducting Spatial Analysis with sf and raster